A Method of Chinese-Vietnamese Bilingual Corpus Construction for Machine Translation
نویسندگان
چکیده
A bilingual corpus is vital for natural language processing problems, especially in machine translation. The larger and better quality the is, higher efficiency of resulting translation is. There are two popular approaches to building a corpus. first one automatically based on resources that available internet, typically websites. second approach construct manually. Automated construction methods being used more frequently because they less expensive there growing number websites exploit. In this paper, we use automated collection website create Chinese-Vietnamese particular, collect data multilingual dictionary (https://glosbe.com). We collected from includes than 400k sentence pairs. chose 100,000 pairs experiments. From corpus, built five datasets consisting 20k, 40k, 60k, 80k, 100k pairs, respectively. addition, additional datasets, applying word segmentation sentences original datasets. experimental results showed that: (1) relatively good with highest BLEU score 19.8, although still some issues need be addressed future works; (2) is; (3) untokenized help train models tokenized
منابع مشابه
POS-Tagger for English-Vietnamese Bilingual Corpus
Corpus-based Natural Language Processing (NLP) tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for unpopular languages (e.g. Vietnamese) are at a deadlock due to absence of annotated training data for these languages. Furthermore, hand-annotation of even reasonably well-determined features such ...
متن کاملConstruction of Chunk-Aligned Bilingual Lecture Corpus for Simultaneous Machine Translation
Abstract With the development of speech and language processing, speech translation systems have been developed. These studies target spoken dialogues, and employ consecutive interpretation, which uses a sentence as the translation unit. On the other hand, there exist a few researches about simultaneous interpreting, and recently, the language resources for promoting simultaneous interpreting r...
متن کاملAutomatic Construction of Translation Knowledge for Corpus-based Machine Translation
Many machine translation (MT) systems that utilize the knowledge automatically acquired from bilingual corpora have been proposed in conjunction with efforts to accumulate corpora. We call this approach corpus-based machine translation in this thesis. This thesis focuses on automatic construction of the translation knowledge needed for corpus-based MT and discusses the following three tasks. 1....
متن کاملBlock Analysis of Bilingual Corpus for Chinese-English Statistical Machine Translation
In this paper, we describe a bilingual corpus processing strategy, block analysis, from a new point of view. By this analysis strategy, we want to extract more information from bilingual corpus for future statistical machine translation. At first, we define some block types and give some statistical data from a Chinese-English bilingual corpus under this framework. Then a block-based alignment ...
متن کاملVietnamese to Chinese Machine Translation via Chinese Character as Pivot
Using Chinese characters as an intermediate equivalent unit, we decompose machine translation into two stages, semantic translation and grammar translation. This strategy is tentatively applied to machine translation between Vietnamese and Chinese. During the semantic translation, Vietnamese syllables are one-by-one converted into the corresponding Chinese characters. During the grammar transla...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2022
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2022.3186978